{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![MLU Logo](../data/MLU_Logo.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a name=\"0\">Machine Learning Accelerator - Natural Language Processing - Lecture 2</a>\n",
    "\n",
    "## Tree-Based Models for a Classification Problem, and Hyperparameter Tuning\n",
    "\n",
    "We continue to work with our review dataset to see how Tree-based classifiers (Decision Tree, Random Forest), along with efficient optimization techniques (GridSearch, RandomizedSearch), perform to predict the __isPositive__ field of our review dataset (that is very similar to the final project dataset).\n",
    "\n",
    "1. <a href=\"#1\">Reading the dataset</a>\n",
    "2. <a href=\"#2\">Exploratory data analysis</a>\n",
    "3. <a href=\"#3\">Stop word removal and stemming</a>\n",
    "4. <a href=\"#4\">Train - Validation Split</a>\n",
    "5. <a href=\"#5\">Data processing with Pipeline and ColumnTransform</a>\n",
    "6. <a href=\"#6\">Fit the Decision Tree classifier</a> Find more details on the __DecisionTreeClassifier__ here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html \n",
    "7. <a href=\"#7\">Test the classifier</a>\n",
    "8. <a href=\"#8\">Fit and test the Random Forest classifier</a> Find more details on the __RandomForestClassifier__ here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\n",
    "9. <a href=\"#9\">Hyperparameter Tuning</a>\n",
    "    * Find more details on the __GridSearchCV__ here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html\n",
    "    * Find more details on the __RandomizedSearchCV__ here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html\n",
    "10. <a href=\"#10\">Ideas for improvement</a>\n",
    "\n",
    "Overall dataset schema:\n",
    "* __reviewText:__ Text of the review\n",
    "* __summary:__ Summary of the review\n",
    "* __verified:__ Whether the purchase was verified (True or False)\n",
    "* __time:__ UNIX timestamp for the review\n",
    "* __log_votes:__ Logarithm-adjusted votes log(1+votes)\n",
    "* __isPositive:__ Whether the review is positive or negative (1 or 0)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. <a name=\"1\">Reading the dataset</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We will use the __pandas__ library to read our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The shape of the dataset is: (70000, 6)\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')\n",
    "\n",
    "print('The shape of the dataset is:', df.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first 10 rows of the dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>reviewText</th>\n",
       "      <th>summary</th>\n",
       "      <th>verified</th>\n",
       "      <th>time</th>\n",
       "      <th>log_votes</th>\n",
       "      <th>isPositive</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO...</td>\n",
       "      <td>IDEAL FOR BEGINNER!</td>\n",
       "      <td>True</td>\n",
       "      <td>1361836800</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>unable to open or use</td>\n",
       "      <td>Two Stars</td>\n",
       "      <td>True</td>\n",
       "      <td>1452643200</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Waste of money!!! It wouldn't load to my system.</td>\n",
       "      <td>Dont buy it!</td>\n",
       "      <td>True</td>\n",
       "      <td>1433289600</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>I attempted to install this OS on two differen...</td>\n",
       "      <td>I attempted to install this OS on two differen...</td>\n",
       "      <td>True</td>\n",
       "      <td>1518912000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>I've spent 14 fruitless hours over the past tw...</td>\n",
       "      <td>Do NOT Download.</td>\n",
       "      <td>True</td>\n",
       "      <td>1441929600</td>\n",
       "      <td>1.098612</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>I purchased the home and business because I wa...</td>\n",
       "      <td>Quicken home and business  not for amatures</td>\n",
       "      <td>True</td>\n",
       "      <td>1335312000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>The download doesn't take long at all. And it'...</td>\n",
       "      <td>Great!</td>\n",
       "      <td>True</td>\n",
       "      <td>1377993600</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>This program is positively wonderful for word ...</td>\n",
       "      <td>Terrific for practice.</td>\n",
       "      <td>False</td>\n",
       "      <td>1158364800</td>\n",
       "      <td>2.397895</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Fantastic protection!!  Great customer support!!</td>\n",
       "      <td>Five Stars</td>\n",
       "      <td>True</td>\n",
       "      <td>1478476800</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Obviously Win 7 now the last great operating s...</td>\n",
       "      <td>Five Stars</td>\n",
       "      <td>True</td>\n",
       "      <td>1471478400</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          reviewText  \\\n",
       "0  PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO...   \n",
       "1                              unable to open or use   \n",
       "2   Waste of money!!! It wouldn't load to my system.   \n",
       "3  I attempted to install this OS on two differen...   \n",
       "4  I've spent 14 fruitless hours over the past tw...   \n",
       "5  I purchased the home and business because I wa...   \n",
       "6  The download doesn't take long at all. And it'...   \n",
       "7  This program is positively wonderful for word ...   \n",
       "8   Fantastic protection!!  Great customer support!!   \n",
       "9  Obviously Win 7 now the last great operating s...   \n",
       "\n",
       "                                             summary  verified        time  \\\n",
       "0                                IDEAL FOR BEGINNER!      True  1361836800   \n",
       "1                                          Two Stars      True  1452643200   \n",
       "2                                       Dont buy it!      True  1433289600   \n",
       "3  I attempted to install this OS on two differen...      True  1518912000   \n",
       "4                                   Do NOT Download.      True  1441929600   \n",
       "5        Quicken home and business  not for amatures      True  1335312000   \n",
       "6                                             Great!      True  1377993600   \n",
       "7                             Terrific for practice.     False  1158364800   \n",
       "8                                         Five Stars      True  1478476800   \n",
       "9                                         Five Stars      True  1471478400   \n",
       "\n",
       "   log_votes  isPositive  \n",
       "0   0.000000         1.0  \n",
       "1   0.000000         0.0  \n",
       "2   0.000000         0.0  \n",
       "3   0.000000         0.0  \n",
       "4   1.098612         0.0  \n",
       "5   0.000000         0.0  \n",
       "6   0.000000         1.0  \n",
       "7   2.397895         1.0  \n",
       "8   0.000000         1.0  \n",
       "9   0.000000         1.0  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. <a name=\"2\">Exploratory data analysis</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the range and distribution of log_votes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "isPositive\n",
       "1.0    43692\n",
       "0.0    26308\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"isPositive\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reviewText    12\n",
      "summary       15\n",
      "verified       0\n",
      "time           0\n",
      "log_votes      0\n",
      "isPositive     0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df.isna().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check the number of missing values for each columm below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reviewText    12\n",
      "summary       15\n",
      "verified       0\n",
      "time           0\n",
      "log_votes      0\n",
      "isPositive     0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df.isna().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have missing values in our text fields."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. <a name=\"3\">Text Processing: Stop words removal and stemming</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We will apply the text processing methods discussed in the class. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /home/studio-lab-\n",
      "[nltk_data]     user/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package stopwords to /home/studio-lab-\n",
      "[nltk_data]     user/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Install the library and functions\n",
    "import nltk\n",
    "\n",
    "nltk.download('punkt')\n",
    "nltk.download('stopwords')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will create the stop word removal and text cleaning processes below. NLTK library provides a list of common stop words. We will use the list, but remove some of the words from that list (because those words are actually useful to understand the sentiment in the sentence)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import nltk, re\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem import SnowballStemmer\n",
    "from nltk.tokenize import word_tokenize\n",
    "\n",
    "# Let's get a list of stop words from the NLTK library\n",
    "stop = stopwords.words('english')\n",
    "\n",
    "# These words are important for our problem. We don't want to remove them.\n",
    "excluding = ['against', 'not', 'don', \"don't\",'ain', 'aren', \"aren't\", 'couldn', \"couldn't\",\n",
    "             'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", \n",
    "             'haven', \"haven't\", 'isn', \"isn't\", 'mightn', \"mightn't\", 'mustn', \"mustn't\",\n",
    "             'needn', \"needn't\",'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \n",
    "             \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n",
    "\n",
    "# New stop word list\n",
    "stop_words = [word for word in stop if word not in excluding]\n",
    "\n",
    "snow = SnowballStemmer('english')\n",
    "\n",
    "def process_text(texts): \n",
    "    final_text_list=[]\n",
    "    for sent in texts:\n",
    "        \n",
    "        # Check if the sentence is a missing value\n",
    "        if isinstance(sent, str) == False:\n",
    "            sent = \"\"\n",
    "            \n",
    "        filtered_sentence=[]\n",
    "        \n",
    "        sent = sent.lower() # Lowercase \n",
    "        sent = sent.strip() # Remove leading/trailing whitespace\n",
    "        sent = re.sub('\\s+', ' ', sent) # Remove extra space and tabs\n",
    "        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:\n",
    "        \n",
    "        for w in word_tokenize(sent):\n",
    "            # We are applying some custom filtering here, feel free to try different things\n",
    "            # Check if it is not numeric and its length>2 and not in stop words\n",
    "            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  \n",
    "                # Stem and add to filtered list\n",
    "                filtered_sentence.append(snow.stem(w))\n",
    "        final_string = \" \".join(filtered_sentence) #final string of cleaned words\n",
    " \n",
    "        final_text_list.append(final_string)\n",
    "        \n",
    "    return final_text_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. <a name=\"4\">Train - Validation Split</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Let's split our dataset into training (90%) and validation (10%)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_val, y_train, y_val = train_test_split(df[[\"reviewText\", \"summary\", \"time\", \"log_votes\"]],\n",
    "                                                  df[\"isPositive\"],\n",
    "                                                  test_size=0.10,\n",
    "                                                  shuffle=True,\n",
    "                                                  random_state=324\n",
    "                                                 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing the reviewText fields\n",
      "Processing the summary fields\n"
     ]
    }
   ],
   "source": [
    "print(\"Processing the reviewText fields\")\n",
    "X_train[\"reviewText\"] = process_text(X_train[\"reviewText\"].tolist())\n",
    "X_val[\"reviewText\"] = process_text(X_val[\"reviewText\"].tolist())\n",
    "\n",
    "print(\"Processing the summary fields\")\n",
    "X_train[\"summary\"] = process_text(X_train[\"summary\"].tolist())\n",
    "X_val[\"summary\"] = process_text(X_val[\"summary\"].tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our process_text() method in section 3 uses empty string for missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. <a name=\"5\">Data processing with Pipeline and ColumnTransform</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "In the previous examples, we have seen how to use pipeline to prepare a data field for our machine learning model. This time, we will focus on multiple fields: numeric and text fields. Find more details on __Decision Trees__ here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html\n",
    "\n",
    "   * For the numerical features pipeline, the __numerical_processor__ below, we use a MinMaxScaler (don't have to scale features when using Decision Trees, but it's a good idea to see how to use more data transforms). If different processing is desired for different numerical features, different pipelines should be built - just like shown below for the two text features.\n",
    "   * For the text features pipeline, the __text_processor__ below, we use CountVectorizer() for the text fields.\n",
    "   \n",
    "The selective preparations of the dataset features are then put together into a collective ColumnTransformer, to be finally used in a Pipeline along with an estimator. This ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a validation dataset via cross-validation or making predictions on a test dataset in the future."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Grab model features/inputs and target/output\n",
    "numerical_features = ['time',\n",
    "                      'log_votes']\n",
    "\n",
    "text_features = ['summary',\n",
    "                 'reviewText']\n",
    "\n",
    "model_features = numerical_features + text_features\n",
    "model_target = 'isPositive'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {\n",
       "  /* Definition of color scheme common for light and dark mode */\n",
       "  --sklearn-color-text: black;\n",
       "  --sklearn-color-line: gray;\n",
       "  /* Definition of color scheme for unfitted estimators */\n",
       "  --sklearn-color-unfitted-level-0: #fff5e6;\n",
       "  --sklearn-color-unfitted-level-1: #f6e4d2;\n",
       "  --sklearn-color-unfitted-level-2: #ffe0b3;\n",
       "  --sklearn-color-unfitted-level-3: chocolate;\n",
       "  /* Definition of color scheme for fitted estimators */\n",
       "  --sklearn-color-fitted-level-0: #f0f8ff;\n",
       "  --sklearn-color-fitted-level-1: #d4ebff;\n",
       "  --sklearn-color-fitted-level-2: #b3dbfd;\n",
       "  --sklearn-color-fitted-level-3: cornflowerblue;\n",
       "\n",
       "  /* Specific color for light theme */\n",
       "  --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n",
       "  --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-icon: #696969;\n",
       "\n",
       "  @media (prefers-color-scheme: dark) {\n",
       "    /* Redefinition of color scheme for dark theme */\n",
       "    --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n",
       "    --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-icon: #878787;\n",
       "  }\n",
       "}\n",
       "\n",
       "#sk-container-id-1 {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 pre {\n",
       "  padding: 0;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 input.sk-hidden--visually {\n",
       "  border: 0;\n",
       "  clip: rect(1px 1px 1px 1px);\n",
       "  clip: rect(1px, 1px, 1px, 1px);\n",
       "  height: 1px;\n",
       "  margin: -1px;\n",
       "  overflow: hidden;\n",
       "  padding: 0;\n",
       "  position: absolute;\n",
       "  width: 1px;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-dashed-wrapped {\n",
       "  border: 1px dashed var(--sklearn-color-line);\n",
       "  margin: 0 0.4em 0.5em 0.4em;\n",
       "  box-sizing: border-box;\n",
       "  padding-bottom: 0.4em;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-container {\n",
       "  /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n",
       "     but bootstrap.min.css set `[hidden] { display: none !important; }`\n",
       "     so we also need the `!important` here to be able to override the\n",
       "     default hidden behavior on the sphinx rendered scikit-learn.org.\n",
       "     See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n",
       "  display: inline-block !important;\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-text-repr-fallback {\n",
       "  display: none;\n",
       "}\n",
       "\n",
       "div.sk-parallel-item,\n",
       "div.sk-serial,\n",
       "div.sk-item {\n",
       "  /* draw centered vertical line to link estimators */\n",
       "  background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n",
       "  background-size: 2px 100%;\n",
       "  background-repeat: no-repeat;\n",
       "  background-position: center center;\n",
       "}\n",
       "\n",
       "/* Parallel-specific style estimator block */\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item::after {\n",
       "  content: \"\";\n",
       "  width: 100%;\n",
       "  border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n",
       "  flex-grow: 1;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel {\n",
       "  display: flex;\n",
       "  align-items: stretch;\n",
       "  justify-content: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item:first-child::after {\n",
       "  align-self: flex-end;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item:last-child::after {\n",
       "  align-self: flex-start;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item:only-child::after {\n",
       "  width: 0;\n",
       "}\n",
       "\n",
       "/* Serial-specific style estimator block */\n",
       "\n",
       "#sk-container-id-1 div.sk-serial {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "  align-items: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  padding-right: 1em;\n",
       "  padding-left: 1em;\n",
       "}\n",
       "\n",
       "\n",
       "/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\n",
       "clickable and can be expanded/collapsed.\n",
       "- Pipeline and ColumnTransformer use this feature and define the default style\n",
       "- Estimators will overwrite some part of the style using the `sk-estimator` class\n",
       "*/\n",
       "\n",
       "/* Pipeline and ColumnTransformer style (default) */\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable {\n",
       "  /* Default theme specific background. It is overwritten whether we have a\n",
       "  specific estimator or a Pipeline/ColumnTransformer */\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "/* Toggleable label */\n",
       "#sk-container-id-1 label.sk-toggleable__label {\n",
       "  cursor: pointer;\n",
       "  display: block;\n",
       "  width: 100%;\n",
       "  margin-bottom: 0;\n",
       "  padding: 0.5em;\n",
       "  box-sizing: border-box;\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 label.sk-toggleable__label-arrow:before {\n",
       "  /* Arrow on the left of the label */\n",
       "  content: \"▸\";\n",
       "  float: left;\n",
       "  margin-right: 0.25em;\n",
       "  color: var(--sklearn-color-icon);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "/* Toggleable content - dropdown */\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content {\n",
       "  max-height: 0;\n",
       "  max-width: 0;\n",
       "  overflow: hidden;\n",
       "  text-align: left;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content pre {\n",
       "  margin: 0.2em;\n",
       "  border-radius: 0.25em;\n",
       "  color: var(--sklearn-color-text);\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content.fitted pre {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n",
       "  /* Expand drop-down */\n",
       "  max-height: 200px;\n",
       "  max-width: 100%;\n",
       "  overflow: auto;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n",
       "  content: \"▾\";\n",
       "}\n",
       "\n",
       "/* Pipeline/ColumnTransformer-specific style */\n",
       "\n",
       "#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator-specific style */\n",
       "\n",
       "/* Colorize estimator box */\n",
       "#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-label label.sk-toggleable__label,\n",
       "#sk-container-id-1 div.sk-label label {\n",
       "  /* The background is the default theme color */\n",
       "  color: var(--sklearn-color-text-on-default-background);\n",
       "}\n",
       "\n",
       "/* On hover, darken the color of the background */\n",
       "#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "/* Label box, darken color on hover, fitted */\n",
       "#sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator label */\n",
       "\n",
       "#sk-container-id-1 div.sk-label label {\n",
       "  font-family: monospace;\n",
       "  font-weight: bold;\n",
       "  display: inline-block;\n",
       "  line-height: 1.2em;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-label-container {\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "/* Estimator-specific */\n",
       "#sk-container-id-1 div.sk-estimator {\n",
       "  font-family: monospace;\n",
       "  border: 1px dotted var(--sklearn-color-border-box);\n",
       "  border-radius: 0.25em;\n",
       "  box-sizing: border-box;\n",
       "  margin-bottom: 0.5em;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-estimator.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "/* on hover */\n",
       "#sk-container-id-1 div.sk-estimator:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-estimator.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Specification for estimator info (e.g. \"i\" and \"?\") */\n",
       "\n",
       "/* Common style for \"i\" and \"?\" */\n",
       "\n",
       ".sk-estimator-doc-link,\n",
       "a:link.sk-estimator-doc-link,\n",
       "a:visited.sk-estimator-doc-link {\n",
       "  float: right;\n",
       "  font-size: smaller;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1em;\n",
       "  height: 1em;\n",
       "  width: 1em;\n",
       "  text-decoration: none !important;\n",
       "  margin-left: 1ex;\n",
       "  /* unfitted */\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted,\n",
       "a:link.sk-estimator-doc-link.fitted,\n",
       "a:visited.sk-estimator-doc-link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "div.sk-estimator:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "/* Span, style for the box shown on hovering the info icon */\n",
       ".sk-estimator-doc-link span {\n",
       "  display: none;\n",
       "  z-index: 9999;\n",
       "  position: relative;\n",
       "  font-weight: normal;\n",
       "  right: .2ex;\n",
       "  padding: .5ex;\n",
       "  margin: .5ex;\n",
       "  width: min-content;\n",
       "  min-width: 20ex;\n",
       "  max-width: 50ex;\n",
       "  color: var(--sklearn-color-text);\n",
       "  box-shadow: 2pt 2pt 4pt #999;\n",
       "  /* unfitted */\n",
       "  background: var(--sklearn-color-unfitted-level-0);\n",
       "  border: .5pt solid var(--sklearn-color-unfitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted span {\n",
       "  /* fitted */\n",
       "  background: var(--sklearn-color-fitted-level-0);\n",
       "  border: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link:hover span {\n",
       "  display: block;\n",
       "}\n",
       "\n",
       "/* \"?\"-specific style due to the `<a>` HTML tag */\n",
       "\n",
       "#sk-container-id-1 a.estimator_doc_link {\n",
       "  float: right;\n",
       "  font-size: 1rem;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1rem;\n",
       "  height: 1rem;\n",
       "  width: 1rem;\n",
       "  text-decoration: none;\n",
       "  /* unfitted */\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 a.estimator_doc_link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "#sk-container-id-1 a.estimator_doc_link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 a.estimator_doc_link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;data_preprocessing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;time&#x27;, &#x27;log_votes&#x27;]),\n",
       "                                                 (&#x27;text_pre_0&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  &#x27;summary&#x27;),\n",
       "                                                 (&#x27;text_pre_1&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  &#x27;reviewText&#x27;)])),\n",
       "                (&#x27;decision_tree&#x27;,\n",
       "                 DecisionTreeClassifier(max_depth=10, min_samples_leaf=15))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" ><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">&nbsp;&nbsp;Pipeline<a class=\"sk-estimator-doc-link \" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.pipeline.Pipeline.html\">?<span>Documentation for Pipeline</span></a><span class=\"sk-estimator-doc-link \">i<span>Not fitted</span></span></label><div class=\"sk-toggleable__content \"><pre>Pipeline(steps=[(&#x27;data_preprocessing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;time&#x27;, &#x27;log_votes&#x27;]),\n",
       "                                                 (&#x27;text_pre_0&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  &#x27;summary&#x27;),\n",
       "                                                 (&#x27;text_pre_1&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  &#x27;reviewText&#x27;)])),\n",
       "                (&#x27;decision_tree&#x27;,\n",
       "                 DecisionTreeClassifier(max_depth=10, min_samples_leaf=15))])</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" ><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">&nbsp;data_preprocessing: ColumnTransformer<a class=\"sk-estimator-doc-link \" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.compose.ColumnTransformer.html\">?<span>Documentation for data_preprocessing: ColumnTransformer</span></a></label><div class=\"sk-toggleable__content \"><pre>ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                  MinMaxScaler())]),\n",
       "                                 [&#x27;time&#x27;, &#x27;log_votes&#x27;]),\n",
       "                                (&#x27;text_pre_0&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=50))]),\n",
       "                                 &#x27;summary&#x27;),\n",
       "                                (&#x27;text_pre_1&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=150))]),\n",
       "                                 &#x27;reviewText&#x27;)])</pre></div> </div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" ><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">numerical_pre</label><div class=\"sk-toggleable__content \"><pre>[&#x27;time&#x27;, &#x27;log_votes&#x27;]</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-4\" type=\"checkbox\" ><label for=\"sk-estimator-id-4\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">&nbsp;MinMaxScaler<a class=\"sk-estimator-doc-link \" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.preprocessing.MinMaxScaler.html\">?<span>Documentation for MinMaxScaler</span></a></label><div class=\"sk-toggleable__content \"><pre>MinMaxScaler()</pre></div> </div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-5\" type=\"checkbox\" ><label for=\"sk-estimator-id-5\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">text_pre_0</label><div class=\"sk-toggleable__content \"><pre>summary</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-6\" type=\"checkbox\" ><label for=\"sk-estimator-id-6\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">&nbsp;CountVectorizer<a class=\"sk-estimator-doc-link \" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\">?<span>Documentation for CountVectorizer</span></a></label><div class=\"sk-toggleable__content \"><pre>CountVectorizer(binary=True, max_features=50)</pre></div> </div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-7\" type=\"checkbox\" ><label for=\"sk-estimator-id-7\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">text_pre_1</label><div class=\"sk-toggleable__content \"><pre>reviewText</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-8\" type=\"checkbox\" ><label for=\"sk-estimator-id-8\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">&nbsp;CountVectorizer<a class=\"sk-estimator-doc-link \" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\">?<span>Documentation for CountVectorizer</span></a></label><div class=\"sk-toggleable__content \"><pre>CountVectorizer(binary=True, max_features=150)</pre></div> </div></div></div></div></div></div></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-9\" type=\"checkbox\" ><label for=\"sk-estimator-id-9\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">&nbsp;DecisionTreeClassifier<a class=\"sk-estimator-doc-link \" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.tree.DecisionTreeClassifier.html\">?<span>Documentation for DecisionTreeClassifier</span></a></label><div class=\"sk-toggleable__content \"><pre>DecisionTreeClassifier(max_depth=10, min_samples_leaf=15)</pre></div> </div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('data_preprocessing',\n",
       "                 ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                                  Pipeline(steps=[('num_scaler',\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  ['time', 'log_votes']),\n",
       "                                                 ('text_pre_0',\n",
       "                                                  Pipeline(steps=[('text_vect_0',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  'summary'),\n",
       "                                                 ('text_pre_1',\n",
       "                                                  Pipeline(steps=[('text_vect_1',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  'reviewText')])),\n",
       "                ('decision_tree',\n",
       "                 DecisionTreeClassifier(max_depth=10, min_samples_leaf=15))])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "### COLUMN_TRANSFORMER ###\n",
    "##########################\n",
    "\n",
    "# Preprocess the numerical features\n",
    "numerical_processor = Pipeline([\n",
    "    ('num_scaler', MinMaxScaler()) # this can be skipped for trees\n",
    "])\n",
    "\n",
    "# Preprocess 1st text feature\n",
    "text_processor_0 = Pipeline([\n",
    "    ('text_vect_0', CountVectorizer(binary=True, max_features=50))\n",
    "])\n",
    "\n",
    "# Preprocess 2nd text feature (larger vocabulary)\n",
    "text_precessor_1 = Pipeline([\n",
    "    ('text_vect_1', CountVectorizer(binary=True, max_features=150))\n",
    "])\n",
    "\n",
    "# Combine all data preprocessors from above (add more, if you choose to define more!)\n",
    "# For each processor/step specify: a name, the actual process, and finally the features to be processed\n",
    "data_preprocessor = ColumnTransformer([\n",
    "    ('numerical_pre', numerical_processor, numerical_features),\n",
    "    ('text_pre_0', text_processor_0, text_features[0]),\n",
    "    ('text_pre_1', text_precessor_1, text_features[1])\n",
    "]) \n",
    "\n",
    "### PIPELINE ###\n",
    "################\n",
    "\n",
    "# Pipeline desired all data transformers, along with an estimator at the end\n",
    "# Later you can set/reach the parameters using the names issued - for hyperparameter tuning, for example\n",
    "pipeline = Pipeline([\n",
    "    ('data_preprocessing', data_preprocessor),\n",
    "    ('decision_tree', DecisionTreeClassifier(max_depth = 10,\n",
    "                                             min_samples_leaf = 15))\n",
    "])\n",
    "\n",
    "# Visualize the pipeline\n",
    "# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps\n",
    "from sklearn import set_config\n",
    "set_config(display='diagram')\n",
    "pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. <a name=\"6\">Fit the Decision Tree classifier</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We train our model by using __.fit()__ on our training dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-2 {\n",
       "  /* Definition of color scheme common for light and dark mode */\n",
       "  --sklearn-color-text: black;\n",
       "  --sklearn-color-line: gray;\n",
       "  /* Definition of color scheme for unfitted estimators */\n",
       "  --sklearn-color-unfitted-level-0: #fff5e6;\n",
       "  --sklearn-color-unfitted-level-1: #f6e4d2;\n",
       "  --sklearn-color-unfitted-level-2: #ffe0b3;\n",
       "  --sklearn-color-unfitted-level-3: chocolate;\n",
       "  /* Definition of color scheme for fitted estimators */\n",
       "  --sklearn-color-fitted-level-0: #f0f8ff;\n",
       "  --sklearn-color-fitted-level-1: #d4ebff;\n",
       "  --sklearn-color-fitted-level-2: #b3dbfd;\n",
       "  --sklearn-color-fitted-level-3: cornflowerblue;\n",
       "\n",
       "  /* Specific color for light theme */\n",
       "  --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n",
       "  --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-icon: #696969;\n",
       "\n",
       "  @media (prefers-color-scheme: dark) {\n",
       "    /* Redefinition of color scheme for dark theme */\n",
       "    --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n",
       "    --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-icon: #878787;\n",
       "  }\n",
       "}\n",
       "\n",
       "#sk-container-id-2 {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 pre {\n",
       "  padding: 0;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 input.sk-hidden--visually {\n",
       "  border: 0;\n",
       "  clip: rect(1px 1px 1px 1px);\n",
       "  clip: rect(1px, 1px, 1px, 1px);\n",
       "  height: 1px;\n",
       "  margin: -1px;\n",
       "  overflow: hidden;\n",
       "  padding: 0;\n",
       "  position: absolute;\n",
       "  width: 1px;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-dashed-wrapped {\n",
       "  border: 1px dashed var(--sklearn-color-line);\n",
       "  margin: 0 0.4em 0.5em 0.4em;\n",
       "  box-sizing: border-box;\n",
       "  padding-bottom: 0.4em;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-container {\n",
       "  /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n",
       "     but bootstrap.min.css set `[hidden] { display: none !important; }`\n",
       "     so we also need the `!important` here to be able to override the\n",
       "     default hidden behavior on the sphinx rendered scikit-learn.org.\n",
       "     See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n",
       "  display: inline-block !important;\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-text-repr-fallback {\n",
       "  display: none;\n",
       "}\n",
       "\n",
       "div.sk-parallel-item,\n",
       "div.sk-serial,\n",
       "div.sk-item {\n",
       "  /* draw centered vertical line to link estimators */\n",
       "  background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n",
       "  background-size: 2px 100%;\n",
       "  background-repeat: no-repeat;\n",
       "  background-position: center center;\n",
       "}\n",
       "\n",
       "/* Parallel-specific style estimator block */\n",
       "\n",
       "#sk-container-id-2 div.sk-parallel-item::after {\n",
       "  content: \"\";\n",
       "  width: 100%;\n",
       "  border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n",
       "  flex-grow: 1;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-parallel {\n",
       "  display: flex;\n",
       "  align-items: stretch;\n",
       "  justify-content: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-parallel-item {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-parallel-item:first-child::after {\n",
       "  align-self: flex-end;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-parallel-item:last-child::after {\n",
       "  align-self: flex-start;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-parallel-item:only-child::after {\n",
       "  width: 0;\n",
       "}\n",
       "\n",
       "/* Serial-specific style estimator block */\n",
       "\n",
       "#sk-container-id-2 div.sk-serial {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "  align-items: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  padding-right: 1em;\n",
       "  padding-left: 1em;\n",
       "}\n",
       "\n",
       "\n",
       "/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\n",
       "clickable and can be expanded/collapsed.\n",
       "- Pipeline and ColumnTransformer use this feature and define the default style\n",
       "- Estimators will overwrite some part of the style using the `sk-estimator` class\n",
       "*/\n",
       "\n",
       "/* Pipeline and ColumnTransformer style (default) */\n",
       "\n",
       "#sk-container-id-2 div.sk-toggleable {\n",
       "  /* Default theme specific background. It is overwritten whether we have a\n",
       "  specific estimator or a Pipeline/ColumnTransformer */\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "/* Toggleable label */\n",
       "#sk-container-id-2 label.sk-toggleable__label {\n",
       "  cursor: pointer;\n",
       "  display: block;\n",
       "  width: 100%;\n",
       "  margin-bottom: 0;\n",
       "  padding: 0.5em;\n",
       "  box-sizing: border-box;\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 label.sk-toggleable__label-arrow:before {\n",
       "  /* Arrow on the left of the label */\n",
       "  content: \"▸\";\n",
       "  float: left;\n",
       "  margin-right: 0.25em;\n",
       "  color: var(--sklearn-color-icon);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "/* Toggleable content - dropdown */\n",
       "\n",
       "#sk-container-id-2 div.sk-toggleable__content {\n",
       "  max-height: 0;\n",
       "  max-width: 0;\n",
       "  overflow: hidden;\n",
       "  text-align: left;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-toggleable__content.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-toggleable__content pre {\n",
       "  margin: 0.2em;\n",
       "  border-radius: 0.25em;\n",
       "  color: var(--sklearn-color-text);\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-toggleable__content.fitted pre {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n",
       "  /* Expand drop-down */\n",
       "  max-height: 200px;\n",
       "  max-width: 100%;\n",
       "  overflow: auto;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n",
       "  content: \"▾\";\n",
       "}\n",
       "\n",
       "/* Pipeline/ColumnTransformer-specific style */\n",
       "\n",
       "#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator-specific style */\n",
       "\n",
       "/* Colorize estimator box */\n",
       "#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-label label.sk-toggleable__label,\n",
       "#sk-container-id-2 div.sk-label label {\n",
       "  /* The background is the default theme color */\n",
       "  color: var(--sklearn-color-text-on-default-background);\n",
       "}\n",
       "\n",
       "/* On hover, darken the color of the background */\n",
       "#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "/* Label box, darken color on hover, fitted */\n",
       "#sk-container-id-2 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator label */\n",
       "\n",
       "#sk-container-id-2 div.sk-label label {\n",
       "  font-family: monospace;\n",
       "  font-weight: bold;\n",
       "  display: inline-block;\n",
       "  line-height: 1.2em;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-label-container {\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "/* Estimator-specific */\n",
       "#sk-container-id-2 div.sk-estimator {\n",
       "  font-family: monospace;\n",
       "  border: 1px dotted var(--sklearn-color-border-box);\n",
       "  border-radius: 0.25em;\n",
       "  box-sizing: border-box;\n",
       "  margin-bottom: 0.5em;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-estimator.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "/* on hover */\n",
       "#sk-container-id-2 div.sk-estimator:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-estimator.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Specification for estimator info (e.g. \"i\" and \"?\") */\n",
       "\n",
       "/* Common style for \"i\" and \"?\" */\n",
       "\n",
       ".sk-estimator-doc-link,\n",
       "a:link.sk-estimator-doc-link,\n",
       "a:visited.sk-estimator-doc-link {\n",
       "  float: right;\n",
       "  font-size: smaller;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1em;\n",
       "  height: 1em;\n",
       "  width: 1em;\n",
       "  text-decoration: none !important;\n",
       "  margin-left: 1ex;\n",
       "  /* unfitted */\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted,\n",
       "a:link.sk-estimator-doc-link.fitted,\n",
       "a:visited.sk-estimator-doc-link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "div.sk-estimator:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "/* Span, style for the box shown on hovering the info icon */\n",
       ".sk-estimator-doc-link span {\n",
       "  display: none;\n",
       "  z-index: 9999;\n",
       "  position: relative;\n",
       "  font-weight: normal;\n",
       "  right: .2ex;\n",
       "  padding: .5ex;\n",
       "  margin: .5ex;\n",
       "  width: min-content;\n",
       "  min-width: 20ex;\n",
       "  max-width: 50ex;\n",
       "  color: var(--sklearn-color-text);\n",
       "  box-shadow: 2pt 2pt 4pt #999;\n",
       "  /* unfitted */\n",
       "  background: var(--sklearn-color-unfitted-level-0);\n",
       "  border: .5pt solid var(--sklearn-color-unfitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted span {\n",
       "  /* fitted */\n",
       "  background: var(--sklearn-color-fitted-level-0);\n",
       "  border: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link:hover span {\n",
       "  display: block;\n",
       "}\n",
       "\n",
       "/* \"?\"-specific style due to the `<a>` HTML tag */\n",
       "\n",
       "#sk-container-id-2 a.estimator_doc_link {\n",
       "  float: right;\n",
       "  font-size: 1rem;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1rem;\n",
       "  height: 1rem;\n",
       "  width: 1rem;\n",
       "  text-decoration: none;\n",
       "  /* unfitted */\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 a.estimator_doc_link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "#sk-container-id-2 a.estimator_doc_link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 a.estimator_doc_link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;data_preprocessing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;time&#x27;, &#x27;log_votes&#x27;]),\n",
       "                                                 (&#x27;text_pre_0&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  &#x27;summary&#x27;),\n",
       "                                                 (&#x27;text_pre_1&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  &#x27;reviewText&#x27;)])),\n",
       "                (&#x27;decision_tree&#x27;,\n",
       "                 DecisionTreeClassifier(max_depth=10, min_samples_leaf=15))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-10\" type=\"checkbox\" ><label for=\"sk-estimator-id-10\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;&nbsp;Pipeline<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.pipeline.Pipeline.html\">?<span>Documentation for Pipeline</span></a><span class=\"sk-estimator-doc-link fitted\">i<span>Fitted</span></span></label><div class=\"sk-toggleable__content fitted\"><pre>Pipeline(steps=[(&#x27;data_preprocessing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;time&#x27;, &#x27;log_votes&#x27;]),\n",
       "                                                 (&#x27;text_pre_0&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  &#x27;summary&#x27;),\n",
       "                                                 (&#x27;text_pre_1&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  &#x27;reviewText&#x27;)])),\n",
       "                (&#x27;decision_tree&#x27;,\n",
       "                 DecisionTreeClassifier(max_depth=10, min_samples_leaf=15))])</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-11\" type=\"checkbox\" ><label for=\"sk-estimator-id-11\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;data_preprocessing: ColumnTransformer<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.compose.ColumnTransformer.html\">?<span>Documentation for data_preprocessing: ColumnTransformer</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                  MinMaxScaler())]),\n",
       "                                 [&#x27;time&#x27;, &#x27;log_votes&#x27;]),\n",
       "                                (&#x27;text_pre_0&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=50))]),\n",
       "                                 &#x27;summary&#x27;),\n",
       "                                (&#x27;text_pre_1&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=150))]),\n",
       "                                 &#x27;reviewText&#x27;)])</pre></div> </div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-12\" type=\"checkbox\" ><label for=\"sk-estimator-id-12\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">numerical_pre</label><div class=\"sk-toggleable__content fitted\"><pre>[&#x27;time&#x27;, &#x27;log_votes&#x27;]</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-13\" type=\"checkbox\" ><label for=\"sk-estimator-id-13\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;MinMaxScaler<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.preprocessing.MinMaxScaler.html\">?<span>Documentation for MinMaxScaler</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>MinMaxScaler()</pre></div> </div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-14\" type=\"checkbox\" ><label for=\"sk-estimator-id-14\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">text_pre_0</label><div class=\"sk-toggleable__content fitted\"><pre>summary</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-15\" type=\"checkbox\" ><label for=\"sk-estimator-id-15\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;CountVectorizer<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\">?<span>Documentation for CountVectorizer</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>CountVectorizer(binary=True, max_features=50)</pre></div> </div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-16\" type=\"checkbox\" ><label for=\"sk-estimator-id-16\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">text_pre_1</label><div class=\"sk-toggleable__content fitted\"><pre>reviewText</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-17\" type=\"checkbox\" ><label for=\"sk-estimator-id-17\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;CountVectorizer<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\">?<span>Documentation for CountVectorizer</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>CountVectorizer(binary=True, max_features=150)</pre></div> </div></div></div></div></div></div></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-18\" type=\"checkbox\" ><label for=\"sk-estimator-id-18\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;DecisionTreeClassifier<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.tree.DecisionTreeClassifier.html\">?<span>Documentation for DecisionTreeClassifier</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>DecisionTreeClassifier(max_depth=10, min_samples_leaf=15)</pre></div> </div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('data_preprocessing',\n",
       "                 ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                                  Pipeline(steps=[('num_scaler',\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  ['time', 'log_votes']),\n",
       "                                                 ('text_pre_0',\n",
       "                                                  Pipeline(steps=[('text_vect_0',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  'summary'),\n",
       "                                                 ('text_pre_1',\n",
       "                                                  Pipeline(steps=[('text_vect_1',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  'reviewText')])),\n",
       "                ('decision_tree',\n",
       "                 DecisionTreeClassifier(max_depth=10, min_samples_leaf=15))])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fit the Pipeline to training data\n",
    "pipeline.fit(X_train, y_train.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. <a name=\"7\">Test the classifier</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Let's evaluate the performance of the trained classifier. We use __.predict()__ this time. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1840  765]\n",
      " [ 716 3679]]\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.72      0.71      0.71      2605\n",
      "         1.0       0.83      0.84      0.83      4395\n",
      "\n",
      "    accuracy                           0.79      7000\n",
      "   macro avg       0.77      0.77      0.77      7000\n",
      "weighted avg       0.79      0.79      0.79      7000\n",
      "\n",
      "Accuracy (validation): 0.7884285714285715\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import confusion_matrix, classification_report, accuracy_score\n",
    "\n",
    "# Use the fitted pipeline to make predictions on the validation dataset\n",
    "val_predictions = pipeline.predict(X_val)\n",
    "print(confusion_matrix(y_val.values, val_predictions))\n",
    "print(classification_report(y_val.values, val_predictions))\n",
    "print(\"Accuracy (validation):\", accuracy_score(y_val.values, val_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. <a name=\"8\">Fit and test the Random Forest classifier</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "This time, we will use the Random Forest classifier. Let's update our pipeline for that"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-3 {\n",
       "  /* Definition of color scheme common for light and dark mode */\n",
       "  --sklearn-color-text: black;\n",
       "  --sklearn-color-line: gray;\n",
       "  /* Definition of color scheme for unfitted estimators */\n",
       "  --sklearn-color-unfitted-level-0: #fff5e6;\n",
       "  --sklearn-color-unfitted-level-1: #f6e4d2;\n",
       "  --sklearn-color-unfitted-level-2: #ffe0b3;\n",
       "  --sklearn-color-unfitted-level-3: chocolate;\n",
       "  /* Definition of color scheme for fitted estimators */\n",
       "  --sklearn-color-fitted-level-0: #f0f8ff;\n",
       "  --sklearn-color-fitted-level-1: #d4ebff;\n",
       "  --sklearn-color-fitted-level-2: #b3dbfd;\n",
       "  --sklearn-color-fitted-level-3: cornflowerblue;\n",
       "\n",
       "  /* Specific color for light theme */\n",
       "  --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n",
       "  --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-icon: #696969;\n",
       "\n",
       "  @media (prefers-color-scheme: dark) {\n",
       "    /* Redefinition of color scheme for dark theme */\n",
       "    --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n",
       "    --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-icon: #878787;\n",
       "  }\n",
       "}\n",
       "\n",
       "#sk-container-id-3 {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "#sk-container-id-3 pre {\n",
       "  padding: 0;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 input.sk-hidden--visually {\n",
       "  border: 0;\n",
       "  clip: rect(1px 1px 1px 1px);\n",
       "  clip: rect(1px, 1px, 1px, 1px);\n",
       "  height: 1px;\n",
       "  margin: -1px;\n",
       "  overflow: hidden;\n",
       "  padding: 0;\n",
       "  position: absolute;\n",
       "  width: 1px;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-dashed-wrapped {\n",
       "  border: 1px dashed var(--sklearn-color-line);\n",
       "  margin: 0 0.4em 0.5em 0.4em;\n",
       "  box-sizing: border-box;\n",
       "  padding-bottom: 0.4em;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-container {\n",
       "  /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n",
       "     but bootstrap.min.css set `[hidden] { display: none !important; }`\n",
       "     so we also need the `!important` here to be able to override the\n",
       "     default hidden behavior on the sphinx rendered scikit-learn.org.\n",
       "     See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n",
       "  display: inline-block !important;\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-text-repr-fallback {\n",
       "  display: none;\n",
       "}\n",
       "\n",
       "div.sk-parallel-item,\n",
       "div.sk-serial,\n",
       "div.sk-item {\n",
       "  /* draw centered vertical line to link estimators */\n",
       "  background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n",
       "  background-size: 2px 100%;\n",
       "  background-repeat: no-repeat;\n",
       "  background-position: center center;\n",
       "}\n",
       "\n",
       "/* Parallel-specific style estimator block */\n",
       "\n",
       "#sk-container-id-3 div.sk-parallel-item::after {\n",
       "  content: \"\";\n",
       "  width: 100%;\n",
       "  border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n",
       "  flex-grow: 1;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-parallel {\n",
       "  display: flex;\n",
       "  align-items: stretch;\n",
       "  justify-content: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-parallel-item {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-parallel-item:first-child::after {\n",
       "  align-self: flex-end;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-parallel-item:last-child::after {\n",
       "  align-self: flex-start;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-parallel-item:only-child::after {\n",
       "  width: 0;\n",
       "}\n",
       "\n",
       "/* Serial-specific style estimator block */\n",
       "\n",
       "#sk-container-id-3 div.sk-serial {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "  align-items: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  padding-right: 1em;\n",
       "  padding-left: 1em;\n",
       "}\n",
       "\n",
       "\n",
       "/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\n",
       "clickable and can be expanded/collapsed.\n",
       "- Pipeline and ColumnTransformer use this feature and define the default style\n",
       "- Estimators will overwrite some part of the style using the `sk-estimator` class\n",
       "*/\n",
       "\n",
       "/* Pipeline and ColumnTransformer style (default) */\n",
       "\n",
       "#sk-container-id-3 div.sk-toggleable {\n",
       "  /* Default theme specific background. It is overwritten whether we have a\n",
       "  specific estimator or a Pipeline/ColumnTransformer */\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "/* Toggleable label */\n",
       "#sk-container-id-3 label.sk-toggleable__label {\n",
       "  cursor: pointer;\n",
       "  display: block;\n",
       "  width: 100%;\n",
       "  margin-bottom: 0;\n",
       "  padding: 0.5em;\n",
       "  box-sizing: border-box;\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 label.sk-toggleable__label-arrow:before {\n",
       "  /* Arrow on the left of the label */\n",
       "  content: \"▸\";\n",
       "  float: left;\n",
       "  margin-right: 0.25em;\n",
       "  color: var(--sklearn-color-icon);\n",
       "}\n",
       "\n",
       "#sk-container-id-3 label.sk-toggleable__label-arrow:hover:before {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "/* Toggleable content - dropdown */\n",
       "\n",
       "#sk-container-id-3 div.sk-toggleable__content {\n",
       "  max-height: 0;\n",
       "  max-width: 0;\n",
       "  overflow: hidden;\n",
       "  text-align: left;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-toggleable__content.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-toggleable__content pre {\n",
       "  margin: 0.2em;\n",
       "  border-radius: 0.25em;\n",
       "  color: var(--sklearn-color-text);\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-toggleable__content.fitted pre {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-3 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n",
       "  /* Expand drop-down */\n",
       "  max-height: 200px;\n",
       "  max-width: 100%;\n",
       "  overflow: auto;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n",
       "  content: \"▾\";\n",
       "}\n",
       "\n",
       "/* Pipeline/ColumnTransformer-specific style */\n",
       "\n",
       "#sk-container-id-3 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator-specific style */\n",
       "\n",
       "/* Colorize estimator box */\n",
       "#sk-container-id-3 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-label label.sk-toggleable__label,\n",
       "#sk-container-id-3 div.sk-label label {\n",
       "  /* The background is the default theme color */\n",
       "  color: var(--sklearn-color-text-on-default-background);\n",
       "}\n",
       "\n",
       "/* On hover, darken the color of the background */\n",
       "#sk-container-id-3 div.sk-label:hover label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "/* Label box, darken color on hover, fitted */\n",
       "#sk-container-id-3 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator label */\n",
       "\n",
       "#sk-container-id-3 div.sk-label label {\n",
       "  font-family: monospace;\n",
       "  font-weight: bold;\n",
       "  display: inline-block;\n",
       "  line-height: 1.2em;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-label-container {\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "/* Estimator-specific */\n",
       "#sk-container-id-3 div.sk-estimator {\n",
       "  font-family: monospace;\n",
       "  border: 1px dotted var(--sklearn-color-border-box);\n",
       "  border-radius: 0.25em;\n",
       "  box-sizing: border-box;\n",
       "  margin-bottom: 0.5em;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-estimator.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "/* on hover */\n",
       "#sk-container-id-3 div.sk-estimator:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-3 div.sk-estimator.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Specification for estimator info (e.g. \"i\" and \"?\") */\n",
       "\n",
       "/* Common style for \"i\" and \"?\" */\n",
       "\n",
       ".sk-estimator-doc-link,\n",
       "a:link.sk-estimator-doc-link,\n",
       "a:visited.sk-estimator-doc-link {\n",
       "  float: right;\n",
       "  font-size: smaller;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1em;\n",
       "  height: 1em;\n",
       "  width: 1em;\n",
       "  text-decoration: none !important;\n",
       "  margin-left: 1ex;\n",
       "  /* unfitted */\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted,\n",
       "a:link.sk-estimator-doc-link.fitted,\n",
       "a:visited.sk-estimator-doc-link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "div.sk-estimator:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "/* Span, style for the box shown on hovering the info icon */\n",
       ".sk-estimator-doc-link span {\n",
       "  display: none;\n",
       "  z-index: 9999;\n",
       "  position: relative;\n",
       "  font-weight: normal;\n",
       "  right: .2ex;\n",
       "  padding: .5ex;\n",
       "  margin: .5ex;\n",
       "  width: min-content;\n",
       "  min-width: 20ex;\n",
       "  max-width: 50ex;\n",
       "  color: var(--sklearn-color-text);\n",
       "  box-shadow: 2pt 2pt 4pt #999;\n",
       "  /* unfitted */\n",
       "  background: var(--sklearn-color-unfitted-level-0);\n",
       "  border: .5pt solid var(--sklearn-color-unfitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted span {\n",
       "  /* fitted */\n",
       "  background: var(--sklearn-color-fitted-level-0);\n",
       "  border: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link:hover span {\n",
       "  display: block;\n",
       "}\n",
       "\n",
       "/* \"?\"-specific style due to the `<a>` HTML tag */\n",
       "\n",
       "#sk-container-id-3 a.estimator_doc_link {\n",
       "  float: right;\n",
       "  font-size: 1rem;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1rem;\n",
       "  height: 1rem;\n",
       "  width: 1rem;\n",
       "  text-decoration: none;\n",
       "  /* unfitted */\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 a.estimator_doc_link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "#sk-container-id-3 a.estimator_doc_link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "#sk-container-id-3 a.estimator_doc_link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "</style><div id=\"sk-container-id-3\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;data_preprocessing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;time&#x27;, &#x27;log_votes&#x27;]),\n",
       "                                                 (&#x27;text_pre_0&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  &#x27;summary&#x27;),\n",
       "                                                 (&#x27;text_pre_1&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  &#x27;reviewText&#x27;)])),\n",
       "                (&#x27;decision_tree&#x27;,\n",
       "                 RandomForestClassifier(max_depth=10, min_samples_leaf=15,\n",
       "                                        n_estimators=150))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-19\" type=\"checkbox\" ><label for=\"sk-estimator-id-19\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;&nbsp;Pipeline<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.pipeline.Pipeline.html\">?<span>Documentation for Pipeline</span></a><span class=\"sk-estimator-doc-link fitted\">i<span>Fitted</span></span></label><div class=\"sk-toggleable__content fitted\"><pre>Pipeline(steps=[(&#x27;data_preprocessing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;time&#x27;, &#x27;log_votes&#x27;]),\n",
       "                                                 (&#x27;text_pre_0&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  &#x27;summary&#x27;),\n",
       "                                                 (&#x27;text_pre_1&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  &#x27;reviewText&#x27;)])),\n",
       "                (&#x27;decision_tree&#x27;,\n",
       "                 RandomForestClassifier(max_depth=10, min_samples_leaf=15,\n",
       "                                        n_estimators=150))])</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-20\" type=\"checkbox\" ><label for=\"sk-estimator-id-20\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;data_preprocessing: ColumnTransformer<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.compose.ColumnTransformer.html\">?<span>Documentation for data_preprocessing: ColumnTransformer</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                  MinMaxScaler())]),\n",
       "                                 [&#x27;time&#x27;, &#x27;log_votes&#x27;]),\n",
       "                                (&#x27;text_pre_0&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=50))]),\n",
       "                                 &#x27;summary&#x27;),\n",
       "                                (&#x27;text_pre_1&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=150))]),\n",
       "                                 &#x27;reviewText&#x27;)])</pre></div> </div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-21\" type=\"checkbox\" ><label for=\"sk-estimator-id-21\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">numerical_pre</label><div class=\"sk-toggleable__content fitted\"><pre>[&#x27;time&#x27;, &#x27;log_votes&#x27;]</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-22\" type=\"checkbox\" ><label for=\"sk-estimator-id-22\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;MinMaxScaler<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.preprocessing.MinMaxScaler.html\">?<span>Documentation for MinMaxScaler</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>MinMaxScaler()</pre></div> </div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-23\" type=\"checkbox\" ><label for=\"sk-estimator-id-23\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">text_pre_0</label><div class=\"sk-toggleable__content fitted\"><pre>summary</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-24\" type=\"checkbox\" ><label for=\"sk-estimator-id-24\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;CountVectorizer<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\">?<span>Documentation for CountVectorizer</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>CountVectorizer(binary=True, max_features=50)</pre></div> </div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-25\" type=\"checkbox\" ><label for=\"sk-estimator-id-25\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">text_pre_1</label><div class=\"sk-toggleable__content fitted\"><pre>reviewText</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-26\" type=\"checkbox\" ><label for=\"sk-estimator-id-26\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;CountVectorizer<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\">?<span>Documentation for CountVectorizer</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>CountVectorizer(binary=True, max_features=150)</pre></div> </div></div></div></div></div></div></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-27\" type=\"checkbox\" ><label for=\"sk-estimator-id-27\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;RandomForestClassifier<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.ensemble.RandomForestClassifier.html\">?<span>Documentation for RandomForestClassifier</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>RandomForestClassifier(max_depth=10, min_samples_leaf=15, n_estimators=150)</pre></div> </div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('data_preprocessing',\n",
       "                 ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                                  Pipeline(steps=[('num_scaler',\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  ['time', 'log_votes']),\n",
       "                                                 ('text_pre_0',\n",
       "                                                  Pipeline(steps=[('text_vect_0',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  'summary'),\n",
       "                                                 ('text_pre_1',\n",
       "                                                  Pipeline(steps=[('text_vect_1',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  'reviewText')])),\n",
       "                ('decision_tree',\n",
       "                 RandomForestClassifier(max_depth=10, min_samples_leaf=15,\n",
       "                                        n_estimators=150))])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "pipeline = Pipeline([\n",
    "    ('data_preprocessing', data_preprocessor),\n",
    "    ('decision_tree', RandomForestClassifier(n_estimators=150,\n",
    "                                             max_depth = 10,\n",
    "                                             min_samples_leaf = 15))\n",
    "])\n",
    "\n",
    "# Fit the Pipeline to training data\n",
    "pipeline.fit(X_train, y_train.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get the predictions on our validation data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1681  924]\n",
      " [ 356 4039]]\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.83      0.65      0.72      2605\n",
      "         1.0       0.81      0.92      0.86      4395\n",
      "\n",
      "    accuracy                           0.82      7000\n",
      "   macro avg       0.82      0.78      0.79      7000\n",
      "weighted avg       0.82      0.82      0.81      7000\n",
      "\n",
      "Accuracy (validation): 0.8171428571428572\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import confusion_matrix, classification_report, accuracy_score\n",
    "\n",
    "# Use the fitted pipeline to make predictions on the validation dataset\n",
    "val_predictions = pipeline.predict(X_val)\n",
    "print(confusion_matrix(y_val.values, val_predictions))\n",
    "print(classification_report(y_val.values, val_predictions))\n",
    "print(\"Accuracy (validation):\", accuracy_score(y_val.values, val_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. <a name=\"9\">Hyperparameter Tuning</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Let's try different parameter values and see how the __DecisionTreeClassifier__ model performs under some combinations of parameters.\n",
    "\n",
    "__Warning__: The number of hyperparameters tuned, along with the cross-validations, can greatly increase training time! Especially if trying hyperparameters tuning on the __RandomForestClassifier__ instead of the lower performing __DecisionTreeClassifier__ that we showcase below for speed! Similar tuning on a __RandomForestClassifier__ model can take more minutes to hours!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9.1 GridSearchCV\n",
    "\n",
    "Find more details on the __GridSearchCV__ here:\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 5 folds for each of 6 candidates, totalling 30 fits\n",
      "Best parameters:  {'decision_tree__max_depth': 30, 'decision_tree__min_samples_leaf': 5}\n",
      "Best score:  0.8423015873015874\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "### PIPELINE GRID_SEARCH ###\n",
    "############################\n",
    "\n",
    "# Parameter grid for GridSearch\n",
    "param_grid={'decision_tree__max_depth': [10, 20, 30],#, 15, 25, 35, 45, 55, 75], \n",
    "            'decision_tree__min_samples_leaf': [5, 10],#, 15, 30],\n",
    "           }\n",
    "\n",
    "grid_search = GridSearchCV(pipeline, # Base model\n",
    "                           param_grid, # Parameters to try\n",
    "                           cv = 5, # Apply 5-fold cross validation\n",
    "                           verbose = 1, # Print summary\n",
    "                           n_jobs = -1 # Use all available processors\n",
    "                          )\n",
    "\n",
    "# Fit the GridSearch to our training data\n",
    "grid_search.fit(X_train, y_train)\n",
    "\n",
    "print(\"Best parameters: \", grid_search.best_params_)\n",
    "print(\"Best score: \", grid_search.best_score_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1982  623]\n",
      " [ 481 3914]]\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.80      0.76      0.78      2605\n",
      "         1.0       0.86      0.89      0.88      4395\n",
      "\n",
      "    accuracy                           0.84      7000\n",
      "   macro avg       0.83      0.83      0.83      7000\n",
      "weighted avg       0.84      0.84      0.84      7000\n",
      "\n",
      "Accuracy (validation): 0.8422857142857143\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import confusion_matrix, classification_report, accuracy_score\n",
    "\n",
    "# Use the fitted pipeline to make predictions on the validation dataset\n",
    "val_predictions = grid_search.best_estimator_.predict(X_val)\n",
    "print(confusion_matrix(y_val.values, val_predictions))\n",
    "print(classification_report(y_val.values, val_predictions))\n",
    "print(\"Accuracy (validation):\", accuracy_score(y_val.values, val_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 8.2 RandomizedSearchCV\n",
    "\n",
    "Find more details on the __RandomizedSearchCV__ here:\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 2 folds for each of 6 candidates, totalling 12 fits\n",
      "Best parameters:  {'decision_tree__min_samples_leaf': 5, 'decision_tree__max_depth': 30}\n",
      "Best score:  0.8405396825396825\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import RandomizedSearchCV\n",
    "\n",
    "# Parameter grid for GridSearch\n",
    "param_grid={'decision_tree__max_depth': [10, 20, 30],#, 15, 25, 35, 45, 55, 75], \n",
    "            'decision_tree__min_samples_leaf': [5, 10],#, 15, 30],\n",
    "           }\n",
    "\n",
    "random_search = RandomizedSearchCV(pipeline, # Base model\n",
    "                                 param_grid, # Parameters to try\n",
    "                                 cv = 2, # Apply 2-fold cross validation\n",
    "                                 verbose = 1, # Print summary\n",
    "                                 n_iter=6, # Test 6 randomly chosen sets of hyperparams\n",
    "                                 n_jobs = -1 # Use all available processors\n",
    "                                )\n",
    "\n",
    "# Fit the GridSearch to our training data\n",
    "random_search.fit(X_train, y_train)\n",
    "\n",
    "print(\"Best parameters: \", random_search.best_params_)\n",
    "print(\"Best score: \", random_search.best_score_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1979  626]\n",
      " [ 463 3932]]\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.81      0.76      0.78      2605\n",
      "         1.0       0.86      0.89      0.88      4395\n",
      "\n",
      "    accuracy                           0.84      7000\n",
      "   macro avg       0.84      0.83      0.83      7000\n",
      "weighted avg       0.84      0.84      0.84      7000\n",
      "\n",
      "Accuracy (validation): 0.8444285714285714\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import confusion_matrix, classification_report, accuracy_score\n",
    "\n",
    "# Use the fitted pipeline to make predictions on the validation dataset\n",
    "val_predictions = random_search.best_estimator_.predict(X_val)\n",
    "print(confusion_matrix(y_val.values, val_predictions))\n",
    "print(classification_report(y_val.values, val_predictions))\n",
    "print(\"Accuracy (validation):\", accuracy_score(y_val.values, val_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. <a name=\"10\">Ideas for improvement</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "**Preprocessing**: We can usually improve performance with some additional work. You can try the following:\n",
    "* Change the feature extractor to TF, TF-IDF. Also experiment with different vocabulary size.\n",
    "* Come up with some other features such as having certain punctuations, all-capitalized words or some words that might be useful in this problem.\n",
    "\n",
    "**Hyperparameter Tuning**: Always a good idea to try other parameter ranges and/or combinations of parameters. If training time is a priority, try __RandomizedSearchCV__ instead of __GridSearchCV__, it's much faster and with almost as good results. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "sagemaker-distribution:Python",
   "language": "python",
   "name": "conda-env-sagemaker-distribution-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
